Identifying Semantically Deviating Outlier Documents

نویسندگان

  • Honglei Zhuang
  • Chi Wang
  • Fangbo Tao
  • Lance M. Kaplan
  • Jiawei Han
چکیده

A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Tagging of Domain-Specific Text Documents with DIAsDEM

Large volumes of electronically available information are stored in textual form. The extraction of semantics from these documents and the characterization of their contents into a databaselike schema is a necessary prerequisite for efficient search and for the fusion of documents semantically belonging together, be they documents about the same company, police reports or legal attests related ...

متن کامل

OutRules: A Framework for Outlier Descriptions in Multiple Context Spaces

Analyzing exceptional objects is an important mining task. It includes the identification of outliers but also the description of outlier properties in contrast to regular objects. However, existing detection approaches miss to provide important descriptions that allow human understanding of outlier reasons. In this work we present OutRules, a framework for outlier descriptions that enable an e...

متن کامل

Comparative Analysis of Outlier Detection Techniques

Data Mining simply refers to the extraction of very interesting patterns of the data from the massive data sets. Outlier detection is one of the important aspects of data mining which actually finds out the observations that are deviating from the common expected behavior. Outlier detection and analysis is sometimes known as outlier mining. In this paper, we have tried to provide the broad and ...

متن کامل

Outlier Document Filtering Applied to the Extractive Summarization

Summarization requires selection of the more informative sentences within a set of documents. Generally, process assumes the document set includes related topics to a subject. However, some of the documents may be outlier and the effect of an outlier document might affect the success of extractive summary. Research is focused on filtering documents at the extraction stage these are outlier. Ext...

متن کامل

DeepDetect: An Extensible System for Detecting Attribute Outliers & Duplicates in XML

XML, the eXtensible Markup Language, is fast evolving into the new standard for data representation and exchange on the WWW. This has resulted in a growing number of data cleaning techniques to locate “dirty” data (artifacts). In this paper, we present DeepDetect – an extensible system that detects attribute outliers and duplicates in XML documents. Attribute outlier detection finds objects tha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017